Piecewise Linear Regression for Massive Data through Dataspheres

نویسندگان

  • Tamraparni Dasu
  • Theodore Johnson
چکیده

We propose enhancing the tting of linear regression models to massive multidimensional data by partitioning the data using a DataSphere (proposed in our previous work) and tting piecewise linear regression models to each class in the representation. Nonlinear models typically involve several iterations through the data and require the knowledge of every data point. Linear regression models, on the other hand, can be computed just from second order summaries. Hence piecewise linear models are often used to capture nonlinear relationships using rectangular binning schemes. A disadvantage of rectangular grids is that the number of classes in the partition grows exponentially with the number of dimensions. We propose using the DataSphere partition to divide the data space for tting piecewise linear regression models, where the number of classes grows linearly in the number of dimensions. The partition is based on distance from a center (distance layers) and direction of maximumvariance (directional pyramids). The method exploits the fact that linear regression coeecients can be computed from second order moments that are generated during the creation of the DataSphere representation. An additional advantage is exibility, since the regression models for a coarser DataSphere representation (fewer distance layers for instance) can be recomputed from the summaries without going back to the raw data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Linear Regression Tree for Multistage Manufacturing Process Control

Reconfigured Piecewise Linear Regression Tree for Multistage Manufacturing Process Control Ran Jin and Jianjun Shi H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, 765 Ferst Drive NW, Atlanta, GA 30332, USA, E-mail: [email protected] Abstract In a multistage manufacturing process, massive observational data are obtained from the measurem...

متن کامل

An integrated heuristic method based on piecewise regression and cluster analysis for fluctuation data (A case study on health-care: Psoriasis patients)

Trend forecasting and proper understanding of the future changes is necessary for planning in health-care area.One of the problems of analytic methods is determination of the number and location of the breakpoints, especially for fluctuation data. In this area, few researches are published when number and location of the nodes are not specified.In this paper, a clustering-based method is develo...

متن کامل

A New Learning Method for Piecewise Linear Regression

A new connectionist model for the solution of piecewise linear regression problems is introduced; it is able to reconstruct both continuous and non continuous real valued mappings starting from a finite set of possibly noisy samples. The approximating function can assume a different linear behavior in each region of an unknown polyhedral partition of the input domain. The proposed learning tech...

متن کامل

General fuzzy piecewise regression analysis with automatic change-point detection

Yu et al. (Fuzzy Sets and Systems 105 (1999) 429) performed general piecewise necessity regression analysis based on linear programming (LP) to obtain the necessity area. Their method is the same as that according to data distribution, even if the data are irregular, practitioners must specify the number and the positions of change-points. However, as the sample size increases, the number of ch...

متن کامل

Hunting Data Glitches in Massive Time Series Data

In a previous paper [5] presented at IQ’99, we had proposed a method for isolating data glitches in massive data sets using a data mining method called DataSpheres. The technique runs in linear time, isolating sections of data that contain corrupted or abnormal data. In this paper, we propose using the DataSphere technique to isolate problems in time series data. We define two types of multivar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998